Integrating BERT Embeddings and BiLSTM for Emotion Analysis of Dialogue

Dialogue system is an important application of natural language processing in human-computer interaction. Emotion analysis of dialogue aims to classify the emotion of each utterance in dialogue, which is crucially important to dialogue system. In dialogue system, emotion analysis is helpful to the semantic understanding and response generation and is great significance to the practical application of customer service quality inspection, intelligent customer service system, chatbots, and so on. However, it is challenging to solve the problems of short text, synonyms, neologisms, and reversed word order for emotion analysis in dialogue. In this paper, we analyze that the feature modeling of different dimensions of dialogue utterances is helpful to achieve more accurate sentiment analysis. Based on this, we propose the BERT (bidirectional encoder representation from transformers) model that is used to generate word-level and sentence-level vectors, and then, word-level vectors are combined with BiLSTM (bidirectional long short-term memory) that can better capture bidirectional semantic dependencies, and word-level and sentence-level vectors are connected and inputted to linear layer to determine emotions in dialogue. The experimental results on two real dialogue datasets show that the proposed method significantly outperforms the baselines.


Introduction
With the continuous progress of science and technology, more and more consumer groups use intelligent products. In order to improve the general performance of intelligent products and user experience, human-computer interaction has become the top priority of intelligent products related research. Dialogue system is an important application of natural language processing in human-computer interaction, for example, both task-oriented dialogue system and open-domain dialogue system, such as intelligent customer service and chatbots. In addition to generating the appropriate response, they also focus on the subjective feelings of users for giving users a good experience in the conversations [1].
Emotion analysis of dialogue system aims to classify the emotion of each utterance that refers to the attitudes, opinions, and emotional tendencies in the dialogue process [2]. Emotion analysis is crucially important to dialogue system and contributes to the good experience of user, i.e., when the user speaks in intelligent customer service system, "I bought it last week, and it's broken," user describes objective facts about product defects and expresses his dissatisfaction with the product by using an angry mood. It follows then that emotion analysis of dialogue is great signifcance to the practical application of customer service quality inspection, intelligent customer service system, chatterbots, and so on.
Te emotion classifcation methods include dictionarybased models, machine learning models, and deep learning models. Dictionary-based models use the polarity and intensity value of emotion terms, the intensity value of degree terms, and the value of negative terms to classify the emotion of sentences. However, dictionary-based models depend on emotions dictionary that is labor-intensive and timeconsuming. In addition, multiple emotions dictionary is difcult to build. Each of machine learning models has advantages and disadvantages in certain situations. In machine learning methods, a sentence emotion classifer can be trained by inputting a large number of sentences with emotional labels and predict the emotion of new sentences. Machine learning methods mainly include k-nearest neighbor, Naive Bayes [3], decision tree, and support vector machine [4], which are extensively used in emotion classifcation. However, machine learning methods require the construction of feature model, which is inefcient and time-consuming.
Te existing deep learning models of emotion analysis mainly use word vector and are based on recurrent neural network. Deep learning-based architecture is superior to the machine learning methods through accuracy and low complexity [5]. However, neural network models only input word-level vectors into neural network and predict the emotion. Sentence-level vectors are not considered and explored to neural network models. As a result, local information and global information are not completely accurate. In addition, although emotion analysis of sentences in a dialogue system is very important, there is no dialogue emotion analysis based on BERT embedding and BiLSTM.
Based on existing research, for emotion analysis, the proposed combination of the BERT (bidirectional encoder representation from transformers) model and the BiLSTM (bidirectional long short-term memory) model is used to model global and local information. Te major contributions of the paper are summarized as follows: (i) Te architecture based on BERT embeddings and BiLSTM is proposed and constructed to determine emotions from dialogue. (ii) As a feature selection model, BERT extracts wordlevel and sentence-level vectors from the inputting data and is embed into the neural network architecture. And then, word-level vectors are combined with BiLSTM to concatenate sentence-level vectors for emotion analysis. (iii) To evaluate the proposed emotion classifcation model, experiments are conducted on two real dialogue datasets. Te experimental results show that the proposed method signifcantly outperforms the baselines.
To the best of our knowledge, this is the frst method of integrating BERT embeddings and BiLSTM for emotion analysis of dialogue. Tis paper goes as follows. Te related works are introduced in the second part. Te third part details the emotion classifcation model based on BERT embeddings and BiLSTM. Next, the experimental is described, and the results are analyzed by comparing with baselines. Te summary of the research is in the last section.

Related Works
In this section, we review some related works on feature vectorization and LSTM in detail. Based on analyzing the limitations of related works, we present a method of integrating BERT embeddings and BiLSTM for emotion analysis of dialogue to address these limitations.

Feature Vectorization.
Te vectorization of text features is a key part of the classifcation tasks. In general, words are mapped into a unifed vectors space. One-hot representation is the simple method of feature vectorization. However, one-hot representation is traditional rule-based or statistics-based natural semantic processing methods that only treat a word as an atomic symbol neglecting the semantic relationship of words. In addition, one-hot representation leads to feature vector with high dimensions that increases the computational complexity and afects the subsequent classifcation. As distributed representation, latent semantic analysis [6], probabilistic latent semantic analysis [7], and latent Dirichlet allocation [8] can extract features for text similarity calculation [9] and text classifcation [10], and these models also neglect the semantic relationship of words [11]. Word embedding contains more information and maps a word into a distributed representation. Each dimension in word embedding vectors space has a specifc meaning. Tere are many models to generate word embedding. Word2Vec is a lightweight neural network that only includes input layer, hidden layer, and output layer. Te Word2Vec framework mainly includes CBOW and skip-gram models according to the diference of input and output [12]. Skip-gram acquires the semantic vector of a word according to text context, and CBOW learns the text context probability of term.
Bidirectional encoder representations from transformers (BERT) is a self-coding pretrained model of language representation [13]. Tere are diferent types of tasks that are used to design BERT. Te frst task randomly selects some words that are replaced with a special symbol (MASK), and then, the model learns to fll in these places according to the labels given. Te second task adds a sentence-level prediction that predicts whether two sentences are continuous for learning the relationship between the continuous segments of text. Some researches of various applications based on BERT to vector text features are proposed [14][15][16]. In sentiment classifcation, a Chinese sentiment classifcation model based on pretrained BERT is used to extract the text abstract features of a single Chinese character based on the context semantic relationship [17]. A long-text classifcation method of Chinese news uses BERT pretrained language model to complete the sentence-level feature vector representation of a news text and captures global features by using the attention mechanism to identify correlated words in text [18]. A novel BERT-based framework is proposed to show the enhanced performance obtainable by combining latent topics with contextual BERT embeddings [19]. A framework based on BERT and CNN with attention mechanism is used to sentiment classifcation of microblog [2]. In the fnancial feld, BERT and CNN are combined for the classifcation of candidate causal sentences [20]. In summary, BERT has excellent performance and is widely used in various felds of text classifcation.

LSTM.
Long short-term memory networks (LSTM) is a special RNN network, which is designed to solve the long dependency problem [21]. Both traditional RNN and LSTM transmit information from front to back, which has limitations in many tasks. Such as POS tagging tasks, the POS of a word is related not only to the word before but also to the word after. In order to solve this problem, bidirectional long short-term memory (BiLSTM) is proposed and composed of two LSTM networks. Te idea is to connect the same input sequence to the forward LSTM and backward LSTM, respectively, and then connect the hidden layers of the two LSTM networks together to the output layer for prediction. BiLSTM already has a variety of applications in technology. BiLSTM-based systems can learn to translate languages [22], document summaries [23], speech recognition [24], dialogue system [25], predicting disease [26], and so on. In sentiment classifcation, the BiLSTM, BiGRU, and CNN model are integrated and proposed for sentiment classifcation [5]. A hybrid model of sentiment classifcation is proposed, which is based on BERT, BiLSTM, and a text convolution neural network [27]. In the legal area, a shallow network with one BiLSTM layer and one attention layer is used to perform Portuguese legal text classifcation [28]. ABLG-CNN is text classifcation model, which is attentionbased BiLSTM fused CNN with gating mechanism for Chinese long-text classifcation. In this model, the attention mechanism is used to calculate context vector of words to derive keyword information. BiLSTM captures context features, and CNN captures topic salient features [29].
However, the existing research of emotion analysis has not use BERT word-level embeddings and sentence-level embeddings for extracting local information and global information from text. In addition, although emotion analysis of sentences in a dialogue system is very important, there is no dialogue emotion analysis based on BERT embedding and BiLSTM. In this study, the architecture based on BERT embeddings and BiLSTM is proposed and constructed to analyze emotions from dialogue. Details of the proposed architecture are described in methods section. Figure 1 shows the research framework of emotion classifcation of dialogue based on the BERT embeddings-BiLSTM model. First, put the data of dialogue into a BERT embedding processor to generate word-level and sentence-level vectors. Ten, word-level feature is processed by BiLSTM. Finally, the processed word-level vectors and sentence-level vectors are connected and inputted to linear layer for emotion analysis of dialogue.

BERT Embedding Processor.
Tis paper selects BERT that has better feature representation ability. In BERT embedding processor, the text of dialogue is converted into word vectors and sentence vector, respectively. Sentence vector represents the semantics feature of sentences and uses the output of the penultimate layer. Sentence vector is a pool output that is a special classifcation token (CLS). Te fnal hidden state corresponding to CLS token is used as the aggregate sequence representation for classifcation tasks. Word vector is a sequence output, which corresponds to the last hidden output of all the words in the sequence. Sentence-level features can represent the original semantics of the whole sentence without reprocessing. Processed by BiLSTM, the dependency relationships between words in word features can be mined, which is a good supplement to sentence feature modeling.
In this section, the sentences of dialogue S are input to BERT embedding processor, and then, word vectors X � {x 1 , x 2 , . . ., x m } and sentence vectors y are generated, where m is the maximum sequence length of sentence. Word vectors X are sent to BiLSTM for further processing.

BiLSTM.
On the basis of word vectors X, X′ is the longdistance dependence of the word vector of the dialogue sentence that is further learned by the BiLSTM model. Te processed feature X′ is combined with the sentence-level feature y of the sentence, which can better represent the feature of the dialogue sentence. Te BiLSTM neural network structure model is divided into two independent LSTMs. Te input sequences are input into two LSTM neural networks in positive and reverse order, respectively, for feature processing. Te splicing of the two output vectors is  Computational Intelligence and Neuroscience the processing word vectors X′ that is used as the fnal feature expression of the words of sentence.
LSTM consists of three gates: forget gate, input gate, and output gate. Te forget gate controls the information obtained from the previous unit and determines the information discarded. Te input gate controls the proportion of the input information added to the unit state. Te output gate controls the update of the current memory state and the output of the hidden layer. Te LSTM neural network is shown in Figure 2.
Te text continues here. In LSTM neural network model, forget gate is used to choose the historical information retained by the cell state, input gate controls the proportion of inputting new information saved to the cell state, and output gate determines the fnal output information. Te mathematical modeling of LSTM unit at the state of the t-th position is shown in the following equations: where f t , i t , and o t denote the forget gate, input gate, and output gate, respectively, h t−1 , W, and b are the output of the previous hidden layer state, weight, and bias of gate neurons, and c t ′ and c t are the cell state and the candidate of cell state. LSTM is a proposed solution to overcome short-term memory problems by introducing internal gates mechanisms that regulate the fow of information. However, LSTM model only encodes information from front to back and cannot capture comprehensive semantic information. BiLSTM is actually two LSTMS that can better capture bidirectional semantics. One LSTM processes the sequence in the forward direction, and the other one processes the sequence in the reverse direction, and then, the output of the two LSTMS is combined. BiLSTM model is used to learn the dependencies between words in sentences and combine the sentence-level features in order to more fully express the sentence features. BiLSTM model is used to enhance feature of word vectors, and the state of the t-th position h t is as follows: where h t → and h t ← are the forward hidden layer state and the backward hidden layer state, respectively.

Other and Linear Layers.
After the feature process in BiLSTM, the processing word vectors X′ are generated. Te dropout layer randomly sets input elements to zero with a given probability 50% to prevent overftting. In fnal, X′ and y are connected and inputted to linear layer to generate emotion F, which is used to conduct emotion analysis of dialogue.

Experimental Results and Analysis
In this section, experiments are conducted to verify the efciency of the proposed method on two real datasets. Our method is compared with the state-of-the-art emotion classifcation models.

Experiment Environment and Dataset.
Te experiments are executed on the server with an GPU RTX 3090@24 G bytes video RAM and AMD (EPYC 7543) 32-Core Processor. Te server is running on Ubuntu 18.04 (64 bits) operating system, PyTorch 1.9.0 with GPU support only, and Python 3.8.
Te frst dataset is multimodal emotion lines dataset (MELD) [30] that includes text, audio, and video information. MELD contains more than 1,400 dialogues, totaling 13,000 utterances from the TV-series Friends. We choose text data as the frst experiment dataset that contains seven emotions, namely anger, disgust, sadness, joy, neutral, surprise, and fear.
Te second dataset used is high-quality multiturn dialogue dataset (DailyDialog) [31], which is only text data with low noise and refects variety topics of daily life without fxed speaker. DailyDialog also contains seven emotions that are neutral, happiness, surprise, sadness, anger, disgust, and fear. 12,218 conversations with 103,607 sentences are in DailyDialog, which is the large data scale.

Evaluation Metrics.
In order to evaluate the efciency of the proposed method, as statistical measures for classifcation models, precision and recall are used in this paper. In general, the higher value of precision and recall, the better efect of classifcation model. Te formula of the precision is as follows: where TP is the number of true positive samples, and FP is the number of false positive samples.
where FN is the number of false negative samples. However, precision and recall contradict each other sometimes. Te most common method is F1-score that considers comprehensively and combines the results of precision and recall. Te higher value of F1-score can show that the classifcation model is more efective. Te formula of the F1-score is as follows: In this paper, we introduce weighted avg precision, recall, and F1-score that use the percentage of the number of samples from each classifcation in the total number of samples from all classifcation as the weight. In addition, FLOPs and params are the number of foating-point operations and the number of arguments that are required to all network models.

Baseline Methods.
In order to validate the proposed model, some baselines are as follows.
BERT-BiLSTM-CNN [27]: It combines the advantages of BERT embedding, BiLSTM, and TextCNN to capture local correlation and retain context information. BERT-BiGRU-CNN: Tis baseline replaces BiLSTM in BERT-BiLSTM-CNN with BiGRU, which has the simpler structure and calculation than BiLSTM. BERT-BiLSTM-attention: A classifcation model based on a BERTembedding and bidirectional LSTM which is combined with self-attention mechanism. BERT-(BiLSTM + CNN): Tis framework is based on BERT embedding and the double channel that is combined with BiLSTM and CNN to capture local correlation and retain context information. BERT-(BiLSTM-attention + CNN) [5]: It is an enhanced BERT-(BiLSTM + CNN) by adding selfattention mechanism to BiLSTM. BiBERT-BiLSTM: Our model is denoted by BiBERT-BiLSTM, which is based on BERT embeddings and BiLSTM. BERT extracts word-level and sentence-level vectors from the inputting data and is embed into the neural network architecture. And then word-level vectors are combined with BiLSTM to concatenate sentence-level vectors for emotion analysis.

Comparison with Baseline
Methods. Te frst experiment results involved comparing all the diferent baseline methods on MELD dataset. Te results are shown in Table 1. We also evaluate the performance of all models. Table 2 shows comparisons between BiBERT-BiLSTM and all baselines on model size (params) and complexity (giga foating-point operations). BiBERT-BiLSTM uses BERT to extract word-level and sentence-level vectors that have twice as much computation as other models with word-level vector only. At the same time, the params of all models are basically similar.

Ablation Study.
To comprehensively test the validity of the proposed method, we conduct ablation study. Table 3 shows comparisons between BiBERT-BiLSTM and the two main baselines on MELD dataset. BiBERT-BiLSTM outperforms BERT-BiLSTM and BERT-Sentence. Te baseline BERT-BiLSTM only uses word-level vectors to BiLSTM for learning the context of dialogue and dismisses sentence-level vectors. BERT-Sentence only retains sentence-level vectors by BERT embedding and dismisses word-level vectors.
Te experiment results of ablation study on DailyDialog are in Figure 4. Te testing accuracy of BiBERT-BiLSTM is 85.44% and the best, which outperforms BERT-BiLSTM by 0.13% and BERT-Sentence by 0.85%. 4.6. Discussion. In order to ensure the validity of the experiment, the various parameters on all models are in agreement in simulation environment. Te same settings of     parameters are based on the existed research: the batch size is 32, the number of iterations is 10, the hidden dimension is 384, respectively, the learning rate is 0.0002, and max length of sentence is 100. Te training loss obtained during the experimentation is shown in Figure 5. From the simulation, the training loss stabilizes at about 0.2.
Te experiment results of two real dialogue datasets prove the validity of the proposed method. BERT is a feature selection model that extracts word-level and sentence-level vectors from the inputting data. Te proposed doublechannel method has better performance by comparing with single-channel method. Ten, word-level vectors are combined with BiLSTM that is used to enhance feature. Te word-level vectors concatenate sentence-level vectors for emotion analysis. Experimental results show that BiBERT-BiLSTM is efcient.

Conclusions
Many researchers have conduct emotion analysis by different models and provide various methods for emotion classifcation. Based on the existed research, the architecture based on BERT embeddings and BiLSTM is proposed and constructed to determine emotion from dialogue. BERT extracts word-level and sentence-level vectors from the input data and is embed into the neural network architecture. And then, word-level vectors are combined with BiLSTM to concatenate sentence-level vectors for emotion analysis. To evaluate the proposed emotion classifcation model, experiments are conducted on two real dialogue datasets. Te experimental results show that the proposed method signifcantly outperforms the baselines. In the future, we plan to extend sentence-level vectors for improving the accuracy of the model. In addition, we also plan to apply the emotion analysis model in dialogue system to generate emotional dialogue.

Data Availability
Te data used to support the fndings of this study are available from the corresponding author upon request.

Conflicts of Interest
Te authors declare that they have no conficts of interest. Computational Intelligence and Neuroscience 7